Retrieving AGNs from images only using pretrained features

Lars Doorenbos, Stefano Cavuoti, Olena Torbaniuk, Giuseppe Longo, Maurizio Paolillo, Massimo Brescia, Raphael Sznitman, Pablo Márquez-Neila

Introduction

We present a method that, given the image of an AGN host galaxy as input, outputs a list of semantically similar objects, which our experiments show are likely to be an AGN host too. This is done by a similarity search in a feature space, obtained by running all images through a neural network pretrained on ImageNet. The method is extremely fast, as it requires no finetuning of the network. Moreover, it makes no assumptions about the input or the dataset it is used on, making it generally applicable. It should make for an excellent tool for the initial data processing of large surveys.

Data

We solely make use of optical images, as this is by far the easiest modality to come by, in line with the main goal of our method: to be an exploratory tool to make an selection of interesting objects to investigate further.

Method

We represent each image as a 1280-dimensional feature vector, by running them through a neural network pretrained on ImageNet, followed by channelwise average pooling of the feature maps in the penultimate layer. Here, each dimension focuses on some higher-level concept, e.g. a particular shape. We hypothesize that images which share a lot of the same image features, i.e. that are close together in this space, have the same astronomical properties.

Then, to retrieve AGNs, we select a number of known AGNs, and look for the closest objects in the dataset. This is done by a nearest neighbour search in the deep feature space. To the best of our knowledge, this is the first attempt at using deep pretrained features to retrieve astronomical objects of interest.

Result

For evaluation, we use a subset of the SDSS based Brinchmann catalogue (https://www.sdss.org/dr12/spectro/galaxy_mpajhu/), which is richer in information, allowing us to identify low luminosity AGNs in well resolved galaxies. We define an AGN as being marked as either "AGN" or "low S/N AGN" in the BPT diagram classification from the Brinchmann catalogue. At this stage we do not use the LSSTC AGN Data here, as found objects could be AGN, but not be labelled as such, and there would be no way of knowing it.

We will using the following 2 objects as reference objects in a running example (http://skyserver.sdss.org/dr16/en/tools/explore/summary.aspx?ra=51.3556883190224&dec=-6.14385552089765 and http://skyserver.sdss.org/dr16/en/tools/explore/summary.aspx?ra=234.088787717649&dec=22.4871197421943):

After obtaining their features, we find the nearest neighbours among the first 100k images of the cleaned Brinchmann catalogue. We display the 25 closest below:

As mentioned, based on a combination of BPT diagram and x-ray diagnostic (see Torbaniuk et al. 2021), we know for these objects whether they are AGN or not. Hence, we can visualize the number of AGNs as a function of the distance to our reference object (in the feature space, not in terms of astronomical distance).

The graph below shows this for both objects. The data contains around 11% AGN, hence that is the fraction we would expect when randomly selecting images from the dataset.

For the first object, say we set the threshold to be 4.6. We would select 232 objects, of which 42% are AGN.

For the second object, say we set the threshold to be 3.25. We would select 70 objects, of which 54% are AGN.

Using the above thresholds, we can move to another part of the dataset to measure its performance there.

We move to the second 100k images of the cleaned Brinchmann catalogue and perform the same experiment, using our chosen thresholds, where we find the results to be very similar.

For the first object, we would select 345 objects, of which 38% are AGN.

For the second object, we would select 81 objects, of which 54% are AGN.

Note that when incorporating all objects, we see how the fraction correctly converges to the fraction of AGN in the whole dataset.

We now turn to the LSSTC AGN Data. Using the same images as the reference object, we again show the 25 nearest neighbours, and find how many objects fall within our tresholds.

For object 1, there are 47 objects in this range, of which we would expect around 20 AGN (+-40%). This result has to be verified based on the unknown (to us) label classes. Unfortunately for object 2, there are no objects in this range, though we nonetheless believe these 25 nearest neighbours to be promising.

Note that the LSSTC AGN Data also contains a large number of stars, so the success rate might be somewhat lower, but so is the percentage of AGNs in the dataset. Results for more reference objects on the challenge data are given at the bottom of this notebook.

We tried using the small sample of labelled XMM data, to verify our method on the challenge data, but all labelled AGNs are optically faint. For example, when using one of them as the reference object, we get 22 among the 25 closest objects as AGN, far better than the 42% present in the sample. However, we reckon this is due to more AGN having a mostly black frame, rather than a succesful application of the method. For objects this faint, all information is practically stored in the central pixel, hence using a method like ours makes little sense.

To evaluate whether our method performs better than just color selection, we try to quantify how much the morphological properties contribute to the final results, in addition to the photometry. For this purpose, we use only the i band (as opposed to gri), which should remove the dependence on color.

In eight out of ten cases we find that removing the color decreases performance by up to 10 percentage points, such as for our first reference object, where +-30% of close objects are now AGN:

In one case, however, the performance almost drops to that of only slightly above random guessing, such as for reference object two, indicating the color (or maybe the combination of morphology and color) is fully responsible for retrieving AGN:

Experiments on challenge data

Finally, we perform two experiments entirely on the challenge data.

Quasars

Randomly picking samples would result in 16.2% quasars. We test whether we can increase this by using four QSOs from the challenge data as reference objects.

Succesful cases:

For the first our method also found many stars, which we reckon is a result of confusing bright blue stars in our Galaxy with quasars. Nonetheless, 40.4% out of the 1000 nearest objects are labelled as QSO.

For the fainter second object, 53.6% out of the 1000 nearest objects are labelled as QSO.

Unsuccesful cases:

For the first object, we think the method has a lower performance since the method is looking mostly to a configuration of multiple objects in the thumbnail, rather than pointing to the central object. We confirm this by showing the 25 nearest neighbours in the image below.

As for the second object, it looks like a distant galaxy rather than a ‘typical’ point-like blue quasar. We think our method found visually similar objects (galaxies), rather than quasars.

Agn

The fraction of AGN in the whole sample is only 1.13%. We test whether we can increase this by using five AGN, with z < 0.2 (only 296 objects (+-0.07%) satisfy this), from the challenge data as reference objects. Among the 1000 closest objects, these find 2.5%, 1.6%, 2.9%, 3.8%, 2.3% AGN respectively.

In this challenge we refer always to the purity of the output since the completeness would be depending, in the first place, on the completeness of the set of images selected for the investigation. Our main point is that once LSST images will become available this method could be applied to any list of AGNs to detect objects which are similar with a good purity in terms of being AGNs.

All features we use come from the thumbnail image alone, and they should be general image patterns, not hand-selected for our use case. We measure to which degree 1280 of the patterns like the ones seen in the "High-level feature" part of the image below are present in an image. However, these patterns come from the model itself, and were derived from natural images, rather than astronomical ones. Hence, they do not necessarily have a logical interpretation for our use case.

In short, the features used are general, non-astronomical, high-level patterns, that were not hand-crafted. Explaining these features is a whole field in itself, which we leave for future work.

(Image taken from https://medium.com/analytics-vidhya/the-world-through-the-eyes-of-cnn-5a52c034dbeb)

Note that our method resizes images to 224x224, despite the current thumbnails being 64x64. Hence, moving to better quality/higher resolution images is a natural path for future improvement. Furthermore, explainable AI might provide new insights into which features are relevant for AGN detection. Finally, rather than measuring the distance to a single query, we could combine distances to multiple queries, in order to get more robust results.

Code

1. Data processing

Read data details

Remove missing cutouts

Load data

Run through pretrained neural network to obtain the feature vector for each image. Best ran with a GPU.

Fit NearestNeighbors on dataset for fast searches

Our selected AGN queries

Function that finds the nearest neighbours for a query. Returns the distances and ids of the nearest neighbours.

In the following examples method has been applied to the challenge dataset, for the first 10 experiments we used objects selected from the Brinchmann catalogue while the last two examples are extracted directly from the challenge dataset. The list just below the 5x5 mosaic represents the classification (according to the challenge labels) of the closest 1000 objects.